Table of Contents:

1. Introduction

Employee turn-over (also known as "employee churn") is a costly problem for companies. The true cost of replacing an employee can often be quite large. A study by the Center for American Progress found that companies typically pay about one-fifth of an employee’s salary to replace that employee, and the cost can significantly increase if executives or highest-paid employees are to be replaced. In other words, the cost of replacing employees for most employers remains significant. This is due to the amount of time spent to interview and find a replacement, sign-on bonuses, and the loss of productivity for several months while the new employee gets accustomed to the new role.

Understanding why and when employees are most likely to leave can lead to actions to improve employee retention as well as possibly planning new hiring in advance. I will be usign a step-by-step systematic approach using a method that could be used for a variety of ML problems. This project would fall under what is commonly known as "HR Anlytics", "People Analytics".

We'll work on questions like: What is the likelihood of an active employee leaving the company? What are the key indicators of an employee leaving the company? What policies or strategies can be adopted based on the results to improve employee retention?

Given that we have data on former employees, this is a standard supervised classification problem where the label is a binary variable, 0 (active employee), 1 (former employee). In this study, our target variable Y is the probability of an employee leaving the company.

In this case study, a HR dataset was sourced from IBM HR Analytics Employee Attrition & Performance which contains employee data for 1,470 employees with various information about the employees. I will use this dataset to predict when employees are going to quit by understanding the main drivers of employee churn.

2. Exploring the data

2.1 Importing libraries

2.2. Auxiliar Functions

2.3. First look at the data

As we can see, there are some features with low variance (quasi constant features), let's drop them hardcoding to include the categorical features. Also, we can note that a lot of variables are skewed. We might have to transform them later.

2.4. Deeper look at the relationships: Features x Leavers

2.4.1. Features with high cardinality:

2.4.2. Features with low cardinality

2.5. Target distribution

So, this is an imbalanced problem, approximately 84/16 %. We will have to handle this distribution before deploying a machine learning model.

2.6. Correlation first analysis

As we can see, "Monthly Rate", "Number of Companies Worked" and "Distance From Home" are positively correlated to Attrition; while "Total Working Years", "Job Level", and "Years In Current Role" are negatively correlated to Attrition.

3. EDA conclusions

4. Pre-processing the data

4.1. Encoding

Let's use the simple yet efficient label encoder. Over fitting may be a problem for our imbalanced problem, so it's a good idea to avoid using target oriented encoding methods.

4.2. Transforming features

Let's see how the skewness is affected with three different variable transformations. The full color bars represent the skewness after the transformation.

Log:

Yeo-Johnson:

Box-Cox:

As shown above, Yeo-Johnson transformer fits better our dataset. One important thing to keep in mind is to split the dataset before transforming, to avoid over-fitting, since Yeo-Johnson learns from the data.

4.3. Splitting the data

4.4. Scaling

As we can see, the skewnesses are decreased, but the values aren't scaled yet. We're going to use the min-max scaling since it doesn't affect the skewness of the features.

4.5. Feature selection

Let's recall our correlation matrix

4.5.1. Correlation based elimination

Checking the new heatmap. As we can see, the correlations with the target are quite low in our data

4.5.2. P-value based elimination

Now, to select the final features for our model, we're going to select them by the Fisher Score. In other words, we're using the features with smallest p-values

Now, let's drop the bottom 5 features. We can do it manually or we can use SK-Learn's "Select K best". Let's use Sk's method for simplicity

5. Building models

5.1. First evaluations

Let's do a first analysis without concerning ourselves with hyperparameters just to get a rough estimate of how different models work on our data.

5.2. Comparing the first results

Classification accuracy is the number of correct predictions made as a ratio of all predictions made. It is the most common evaluation metric for classification problems. However, it is often misused as it is only really suitable when there are an equal number of observations in each class and all predictions and prediction errors are equally important. It is not the case in this project, so a different scoring metric may be more suitable.

Now, we can filter some models and tune the best of them.

5.3. Applying and evaluating the best models

5.3.1 Logistic Regression

5.3.2. Random Forest

Random Forest allows us to know which features are of the most importance in predicting the target feature ("attrition" in this project). Below, we plot features by their importance.

Using the data from the feature importance, we can create a table to guide us in the final conclusions

5.3.3. Support-Vector Machine

Notice that our SVM model just predicted everything as 0. We're going to discard this model.

5.3.4. XGBoost

5.4. Voting Ensemble

So, as we evaluated three good models, we can come up with an even better model using Voting Ensemble.

6. Conclusions

Let's recall out feature importance plot:

The stronger indicators of people leaving include: